Self Consistency: A General Recipe for Wavelet Estimation With 
Irregularly-spaced and/or Incomplete Data* 



Thomas C. M. Leet 
The Chinese University of Hong Kong/ 
Colorado State University 



Xiao-Li Meng* 
Harvard University 



January 6, 2006 



Abstract 



Inspired by the key principle behind the EM algorithm, we propose a general methodology for 
conducting wavelet estimation with irregularly-spaced data by viewing the data as the observed 
portion of an augmented regularly-spaced data set. We then invoke the self-consistency principle 
to define our wavelet estimators in the presence of incomplete data. Major advantages of this 
approach include: (i) it can be coupled with almost any wavelet shrinkage methods, (ii) it can 
deal with non-Gaussian or correlated noise, and (iii) it can automatically handle other kinds of 
missing or incomplete observations. We also develop a multiple-imputation algorithm and fast 
EM-type algorithms for computing or approximating such estimates. Results from numerical 
experiments suggest that our algorithms produce favorite results when comparing to several 
common methods, and therefore we hope these empirical findings would motivate subsequent 
theoretical investigations. To illustrate the flexibility of our approach, examples with Poisson 
data smoothing and image denoising are also provided. 

Key words and phrases: EM algorithm, image denoising, non-equispaced data, missing data, 
multiscale methods, non-parametric regression, wavelet thresholding. 

1 A Brief Literature Review and Overview 

Since the early 1990s, wavelet techniques, especially for nonparametric regressions and signal de- 
noising, have attracted enormous attention from researchers across different fields. Two major 
reasons for this are that wavelet estimators enjoy excellent theoretical properties and that they 
are capable of adapting simultaneously to spatial and frequency inhomogeneities (e.g., see Donoho 
& Johnstone 1994, Donoho & Johnstone 1995 and Donoho, Johnstone, Kerkyacharian & Picard 
1995). Also, they are backed up by a fast algorithm (e.g., see Mallat 1989). 

A restrictive vanilla setting is as follows. We have N = 2 J observations yi satisfying 



and our goal is to estimate the regression function or signal / via a wavelet method. This usually 
consists of three steps. The first step is to compute the empirical wavelet coefficient vector w by 
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applying a discrete wavelet transform (DWT) to y = (yo, . . . ,Hn-i) T \ that is, if W denotes the 
DWT matrix, then w is given by w = Wy. The second step is to apply a shrinkage operation 
(e.g., thresholding) to w to obtain an estimated wavelet coefficient vector w. Finally, / = W T w 
is computed as the wavelet estimate or reconstruction of / = (/(0), . . . , f( N ^ )) T ■ Details can 
be found in numerous monographs, such as Daubechies (1992), Ogden (1996), Mallet (1999), and 
Vidakovic (1999). 

The most popular shrinkage operation in the statistical literature appears to be thresholding. 
Earliest examples include the "universal" thresholding method of Donoho & Johnstone (1994), 
the SURE thresholding method of Donoho & Johnstone (1995), and the method of Saito (1994) 
that uses the minimum description length (MDL) principle of Rissanen (1989). Since then many 
different thresholding methods have also been proposed, including the cross-validation method of 
Nason (1996), the refined MDL based methods of Moulin (1996) and of Lee (2002), the cross- 
validatory AIC method of Hurvich & Tsai (1998), and the method of Crouse, Nowak & Baraniuk 
(1998) that uses hidden Markov models. Moreover, various Bayesian and empirical Bayes methods 
have also been proposed; e.g., see Chipman, Kolaczyk & McCulloch (1997), Abramovich, Sapatinas 
& Silverman (1998), Clyde, Parmigiani & Vidakovic (1998), Vidakovic (1998), Clyde & George 
(1999, 2000), and Johnstone & Silverman (2005). In addition, some treatment has also been given 
to the issue of correlated noise, see, for example, Wang (1996), Johnstone & Silverman (1997), 
Jansen & Bultheel (1999) and Lee (2002). Lastly, robust wavelet smoothing has been considered 
for examples by Sardy, Tseng & Bruce (2001) and Oh, Nychka & Lee (2005), while methods for 
reducing boundary artifacts are studied by Lee & Oh (2004) and Oh & Lee (2005). 

However, the applicability of many existing wavelet regression methods is limited by the as- 
sumptions that the data are not only complete but also equispaced and that the number of design 
points is an integer power of 2, as in (1). Different methods have been proposed to relax the 
equal-spaced assumption. Hall & Turlach (1997) proposed the uses of two interpolation rules for 
mapping the observed data into a regular grid. They provided a detailed theoretical analysis of 
their proposals and an asymptotic choice of the thresholding value. Kovac & Silverman (2000) 
also investigated the use of interpolation. They developed a fast algorithm for computing the noise 
covariance structure after the original observed data are mapped into an interpolation grid. With 
the knowledge of this new noise covariance structure, tailored thresholding values can be derived. 
Nason (2002) demonstrated that this fast covariance updating algorithm is extremely useful for 
speeding up cross-validation type calculations. Interpolation based methods were also studied in 
Sardy, Percival, Bruce, Gao & Stuetzle (1999). Cai & Brown (1998) took a different approach by 
invoking distributional assumptions for the design points. More recently, Antoniadis & Fan (2001) 
consider the use of penalized least squares for handling non-equally spaced data. In their procedure 
the best curve estimate is defined as the minimizer of a regularized least squares criterion. Start- 
ing with a regularized wavelet interpolation of the non-equally spaced observed data, a one-step 
iterative algorithm is applied to approximate the minimizer. Finally, the method of lifting can be 
applied to construct wavelets for irregularly spaced data (e.g., see Delouille, Simoens & von Sachs 
2004 and Nunes, Knight & Nason 2006). However, thresholding methods for such lifting wavelet 
bases seem to be less developed. 

Whereas most of these non-equispaced methods are effective for various applications, they 
impose some additional assumptions that are not used with regular designs; e.g., the implicit 
smoothness assumption in interpolation methods. In this article we demonstrate that it is possible 
to deal with the irregular design problem without making such assumptions. Our key idea is to 
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view an irregular design as a regular design with missing observations. This view allows us to 
utilize those well-known and widely tested methods in the extensive literature on estimation and 
computation for missing data, in particular EM-type algorithms (e.g., Dempster, Laird & Rubin 
1977, Meng & Rubin 1993 and Meng & van Dyk 1997) and multiple imputation (e.g., Rubin 1987 
and Meng 1994). We invoke a self-consistency criterion, in the same spirit as in Efron (1967) 
and as the self-consistency principle underlying the EM algorithm, to define the wavelet regression 
estimator with incomplete data. 

Our approach is that, given only a complete-data procedure and the corresponding model on 
the missing-data mechanism, we first seek the most efficient wavelet estimator for the values of the 
regression function at the observed designed points; i.e., design points at which the response y is ob- 
served. We then incorporate into such "optimal" estimation procedure any additional assumptions 
that are not used with the complete procedure, provided that such assumptions can further improve 
the efficiency for any particular applications. This separation between the inherited information in 
the observed data and the information built into a procedure due to external assumptions helps to 
obtain more efficient wavelet regression estimates with irregular designs. This is not surprising as 
sensible self-consistent procedures often lead to the most efficient estimators, both in the paramet- 
ric estimation (e.g., maximum likelihood estimation via the EM algorithm) and the nonparametric 
estimation (e.g., the Kaplan-Meier estimator) contexts. Indeed, as Tarpey & Flury (1996) put 
it, self-consistency is a fundamental concept in statistics, and is a general statistical principle for 
retaining as much as possible the information in the data. In addition, as demonstrated below, 
another advantage of this approach is that it can be straightforwardly extended to more compli- 
cated settings, such as image denoising and non-Gaussian errors, and it handles non-equispaced 
data simultaneously with other kind of incomplete data, such as in photo inpainting applications 
(see Section 5). 

The rest of this article is organized as follows. Section 2 reviews the self-consistency principle. 
It also presents a self-consistency criterion for regression function estimation, from which our non- 
equispaced wavelet estimator is implicitly defined. In Section 3 three algorithms are constructed 
to compute or approximate this wavelet estimator. Further extensions including two-dimensional 
implementation are discussed in Section 4, and simulation results are reported in Section 5. Future 
work, especially regarding theoretical development, is discussed in Section 6. 

2 Self Consistency: How Does It Work? 
2.1 Self-consistency: An Intuitive Principle 

To illustrate the self-consistency principle, imagine the following scenario. Mr. Littlestat was told 
by his boss to prepare a presentation on the growth in sales since the company's inception in 1993, 
and to make a prediction for the next couple of years. Eye-balling the 13 years of the data he 
was given, he felt that he could draw a reasonably-looking line, but he also knew that it would 
not please his boss. He vaguely remembered something called "least-squares line" from his college 
days, but that was all he could remember. So he asked his teenage son, who was a member of a 
high-school math club. Indeed, his son not only knew about the method, but actually had just 
programmed it for a homework problem. However, as his son was dashing out for a movie date, 
he briefly showed the program on his laptop to his father and said "Dad, I'm really running late. 
Just type in your data as two columns here, click that little thing, and you will find a drawing at 
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the printer." 

Not thrilled by his son's rushing but nevertheless pleased to have the program, Mr. Littlestat 
sat down and started to type in his data. Then he was really unhappy, as he found out that his 
son's program was hardwired specifically for solving the homework that had 16 data points. It 
simply would not run for Mr. Littlestat data set with 13 data points. There were three more rows 
needed to be filled in. 

"What should I do now?" Mr. Littlestat asked himself. He knew nothing about the method, 
nor how to modify his son's program. Nor did he have the patience to wait for his son's return, as 
he really needed to finish the preparation for his presentation tomorrow. 

As the desperation set in, Mr. Littlestat thought "Well, what if I just make up some numbers 
for the next three years, and see what happens?" So he did, and clicked. Instantly, a printout came 
out from the printer with a drawing of a line and the 16 points displayed, including the three fake 
sales figures. 

Excited, Mr. Littlestat examined the plot, and saw the line was visually a bad fit to the 13 
years of the real data. "Well, that's expected, since I put in some fake sales figures", he murmured 
to himself. But then it hit him: "since I made up these three points anyway, why don't I just make 
them to sit on this line and run the program again? That probably would be better ..." 

So he reentered the three fake sales figures by reading off the sales figures from the line, clicked 
again. The new printout showed a line fitting better for the past 13 years, but the three fake sales 
were still off from the new line, though they were a bit closer. "Hmmm, this is interesting". He 
was getting intrigued by his clever invention, "why not try this again?" So he reentered the three 
future sales by reading off the line again. 

Mr. Littlestat's excitement increased with the number of printouts, as the line fluctuated less and 
less, and the three future sales got closer and closer to the fitted line. Then everything stopped, 
and the three future figures sit on the line exactly, as far as Mr. Littlestat could tell. Visually 
inspecting all the plots he had, he exclaimed, "I guess this is it!" while holding the last plot. 

Although Mr. Littlestat was far from sure whether the line in his hand had anything to do with 
the least-squares line he was after, he convinced himself that he had done his best given what he 
was given, because there was really nothing else he could do — the line simply stopped moving, 
and that had to be its best in some sense for it could not be improved further. 

Mr. Littlestat was indeed right. His intuitive pursue actually had led to the correct answer. The 
limit of his iterative procedure is indeed the least-squares fit, a consequence of the "self-consistency" 
property, as we shall explain below. We provided this story to illustrate and emphasize that self- 
consistency is nothing more than good common sense. It does not always work, just as not all 
"common senses" would lead to good answers, but it is typically suggestive, and often leads to 
optimal solutions that can be justified mathematically. 

In the least-squares example above, Mr. Littlestat's method worked because the least-squares 
estimator is self-consistent in the following sense. Suppose we have a regression setting for which 
x is univariate: 

Ui = f3xi + €i, i = l,...,n, €i ~ i.i.d. F(0, a 2 ), 

where F(0, a 2 ) denotes a distribution with mean zero and variance a 2 . The least-squares estimator 
of f3 then is given by 

P n = P n (y 1 ,...,y n ) = ^= iyiX 2 i . (2) 

2^i=i x i 
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Then, for any m < n, as long as YH=m+i 

xj > 0, 

E (p n Vl, ■ ■ ■ iVm, P = Pm) = Pm- (3) 

That is, the least-squares estimator has a Martingale-like property, and reaches a perfect equilibrium 
in its projective properties. Therefore, we can obtain (3 m from a procedure for computing f3 n with 
n > m by solving (3) for the "fixed point" $ m . And this can be solved iteratively without knowing 
the form of [3 n as long as we can compute the average of the values of $ n over the conditional 
distribution as required by the left-hand side of (3). In this specific case, Mr. Littlestat's "line 
fitting" method is correct because of the linearity of p n in the yis. Specifically, starting with 
some initial guess, P$ say, at the t th iteration, we can impute the three "missing" j/j's by their 
conditional expectations yf^ = pf^Xi (i = 14, 15, 16), and then compute the next iterative estimate 

rf3 +1) =/3i6(yi,---,yi3,yS,yg,yg)- (4) 

Evidently, because of (2), the limit of (4), denoted by $13, must satisfies 

E!£l^ + /3l3E!£l4*l 



013 



-16 r 2 
A=l X{ 



which means that ^13 = X)J=i Vi x i/(J2i=i x 1)i the correct least-squares estimate with 13 data points. 



2.2 A Self consistent Regression Estimator 

Inspired by the least-squares estimator above, we propose to also invoke the self-consistency prin- 
ciple for regression function estimation in general when facing incomplete data. Note that although 
in this paper we focus on wavelet regression models, our proposal extends to more general non- 
parametric or semi-parametric regression setting. Specifically, let -Zobs and Z com be, respectively, 
the observed and (imaginary) complete data. Suppose for the moment that, given Z com , we have a 
method for computing the "best" estimate / com for our regression function /. We propose that / t, s , 
our estimate of / given Z G b s , under squared loss, to be the solution of the following self-consistent 
equation: 

E j/com(/)|-Zobs> / = /obs># = #obs} = /obs(')> ( 5 ) 

where 6 collects all nuisance parameters (such as the variance parameter); for notation simplicity, 
for the rest of this paper we will suppress, but not ignore, this conditioning on 9. Since (5), as 
demonstrated in Section 3, can be solved numerically via iterations, it provides a way to obtain our 
"best" incomplete-data estimator / Q b s by simply using the corresponding complete-data procedure 
that computes f com , much like the EM algorithm obtains the incomplete-data MLE via procedures 
for complete-data MLE. In doing so, no additional assumptions will be required, other then the 
necessary specification of the conditional distribution of the missing part of Z com given the observed 
Z bs- We note that such specifications are necessary as otherwise what is missing can never be 
recovered. 

The self-consistent equation (5) was also motivated by its success in estimating cumulative 
distribution function (CDF) with censored or truncated data, where / CO m( - ) will be the empirical 
CDF, a topic that has been studied extensively in the literature (e.g., see Efron 1967). Indeed, it 
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is well known that the Kaplan-Meier estimator is the solution to (5) for the right censored data 
under non- informative censoring, and it is also the generalized maximal likelihood estimator (e.g., 
Johansen 1978). In addition, similar ideas have also been successfully adopted to construct various 
nonlinear data summaries; e.g., see Tarpey & Flury (1996) and references given therein. Of course, 
the most spectacular success of the self-consistency principle is the EM algorithm (Dempster et al. 
1977) and its various generalizations (e.g., Meng & Rubin 1993 and Meng & van Dyk 1997) where 
the self-consistency principle, when applied to the complete-data score function, typically leads to 
incomplete-data maximum likelihood estimator. Indeed, our whole investigation as reported in 
this paper was guided closely by our knowledge and insights of the development of EM algorithm 
and its extensions, which turned out to be extremely fruitful for our current purposes. 



2.3 Heuristics 

In view of these great successes, it is natural to expect that our estimator / b s , the solution to 
(5), would possess excellent statistical properties. The following heuristics provides some insight 
and indication on why (5) can lead to efficient estimator under squared loss. Suppose f com is the 
complete-data optimal estimator that minimizes \\f — / C om|| 2 - Then, for any / bs 5 

II/ - fobs || = II/ - /com || +11 /com — /obs 1 1 

= 11/ — /com|| 2 + H/com — E(f com \ Z obs , /)|| 2 + || E(f com \ Z obs , f) — / bs|| 2 - 

Thus, minimizing ||/ - / bs|| 2 over / obs is equivalent to minimizing \\E(f com \Z ohs , f) - f obs \\ 2 , 
because the first two terms in the right hand side of the last equality are constants with respect 
to the minimization. It is thus natural to suspect that the solution of (5) is the minimizer of 
11/ — /obs II 2 1 at least asymptotically. 

There is also a Bayesian heuristics for (5). From a Bayesian view point, if / com is the Bayesian 
estimator for / given the complete data, then the Bayesian estimator given the observed data, 
under the squared loss, is / bs( - ) = E{f com (-)\Z ODS } . Although / ODS depends on the choice of prior, 
if it is an efficient estimate of the true /, it is reasonable to expect that asymptotically the ratio 
between the posterior expectation and the conditional sampling expectation evaluated at / = / Q b s , 
both of /com, converges (almost surely) to 1. That is, 

E \ /com(')l-^obs) / = /obs f E \ / com (•) | -Zobs , f = /obs f 

1 1 J - - 1, (6) 



/obs(0 E j/com(-)l^obs} 



which implies that (5) should hold at least asymptotically. 

We note that although the heuristics arguments above are appealing, we currently do not have 
rigorous theoretical results to establish the optimality of the self-consistent estimator. Indeed, even 
to prove the existence of the solution to the self-consistent equation (5) is a theoretical task that we 
have not been able to carry through. Our work, as reported in this paper, therefore has been more 
of an "engineering nature", focusing on constructing algorithms to solve (5) and to demonstrate 
empirically the good performance, conceptual and implemental simplicity, as well as the flexibility of 
the self-consistent approach. We hope that these empirical demonstrations show the great promise 
of this self-consistent approach, and thereby stimulate investigation of the theoretical properties, 
including optimality, of the self-consistent estimator. 
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3 Three Algorithms 



The self-consistent estimator / Q b s would not be of much practical value if (5) could not be solved 
with reasonable computational effort. To solve (5), two steps are involved. The first is to carry 
out the conditional expectation on the left-hand side, and the second is to solve the equation, in 
analogous to the E-step and M-step of the EM algorithm, respectively. However, unlike many 
common EM applications where the E-step is in closed form, in the wavelet applications, the exact 
E-step is typically analytically infeasible. This is because, due to the shrinkage operation, / com 
is a highly complicated non-linear function of the missing y's. There are two general approaches 
for dealing with such a problem. The first is to use a Monte Carlo E-step, as in Wei & Tanner 
(1990) and Meng & Schilling (1996), and the second is to trade the exactness for simplicity by 
making certain approximations to the conditional expectation. Below we propose three algorithms 
for computing or approximating / t, s , one of which is based on the first Monte Carlo approach, and 
the other two follow the second approximation approach. 

3.1 A Multiple Imputation Self-Consistent (MISC) Algorithm 

First we fix the notation. Define i" b s as the "observed data index set" : i 6 / Q b s if the iih. data point 
(xi, yi) in (1) is observed. Let y = (yo, . . . , yjv-i) T denote the complete responses, and let y mis and 
y ohs denote respectively the missing and observed portions of y. That is, y mis = {yi : i iobs} 
and y obs = {yi : i G I bs}- Define x obs = {xi : i G iobs}> and hence the observed data is 
Z bs = {#obs> 2/obs}- Also denote the covariance matrix of y as Cov(y) = S. Thus we allow the 
error terms ej's to be correlated, as long as £ can be identified and efficiently estimated from y obs . 

Our first algorithm, termed the multiple imputation self- consistent (MISC) algorithm, assumes 
that a complete-data wavelet regression procedure has been chosen (e.g., the SURE method of 

Donoho & Johnstone 1995). Starting with initial estimates /(°) and the algorithm iterates 

the following three steps for t = 1, . . .: 

Step 1 Multiple Imputation: For m = 1, . . . , M, simulate y m i S)m independently from 

P(y mis , m \y ohs ;f = f it - 1) ,V = ± it ~ 1) ). 

Step 2 Wavelet Shrinkage: For m = 1, . . . ,M, apply the chosen complete-data wavelet shrinkage 
procedure to the completed data y m = {y Q b s , y m i s m } and obtain fm(xi), i = 0, . . . , N — 1. 

Step 3 Combining Estimates: Compute the t-th iterative estimate of / as 

1 M 

f (t \ x i) = T7 E fm(xi), i = 0,...,N-l. (7) 
M m=i 

Also, use the residuals {yi — f^\xi) : i G iobs} to obtain an efficient estimate \ such as 
MLE, of S. 

In Step 1 above, the larger the M, the better results one would expect, but at the expense of 
increased computational time. Our numerical experience indicates that, as long as M is larger 
than a minimum cutoff, the additional improvement on / computed with a larger M is not largely 
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significant. In our numerical experiments, we typically used M = 100, though M = 10 is sometimes 
acceptable as well. 

It is evident that the above is a generic algorithm, in the sense that it is not restricted by 
the specific form of the complete-data wavelet regression procedure, nor by the choice of the error 
distribution. On one hand, this is a great advantage as it is extremely flexible and the additional 
programming, relative to that for the complete-data procedure, is minimal as long as it is easy to 
draw from the conditional distribution in the first step (which typically is the case for Gaussian 
errors, independent or not). It also provides a benchmark and basis for developing more specialized 
and sophisticated algorithms. On the other hand, it is a "brute force" algorithm, and is thus quite 
inefficient as a numerical algorithm. 

3.2 A Simple (Sim) Approximated Algorithm 

To construct a faster algorithm for computing / Q b s , we started with replacing the costly multiple 
imputation step in the MISC algorithm by a very simple analytical approximation. We label 
the resulting algorithm as the simple (Sim) algorithm, and it is designed for a specific type of 
shrinkage methods, namely, for hard thresholding methods for which the thresholding value is a 
known function g(a) of a, where a is an estimate of a. A classical example for g(a) is the universal 
thresholding scheme of Donoho & Johnstone (1994), for which g(a) = &\J2 log N. 

Starting with /(°) and M°\ the Sim algorithm iterates, at the t-th iteration, the following steps: 

Step 1 For each i such that i J b s , impute the corresponding missing t/i by y^f 1 = /^'^(xi), 
which creates a completed-data set: = {yi : i € -fobs} U {yf 1 : i i" bs}- 

Step 2 Apply a DWT to to obtain the empirical wavelet coefficients = Wy^. 

Step 3 Obtain a robust estimate a® of a from for example, the median absolute deviation 
method used by Donoho & Johnstone (1994). We call a® the unadjusted estimate for a. 

Step 4 Use the following variance inflation formula to obtain an adjusted estimate a® for a: 

&® = VV^ + C™^*- 1 )} 2 , (8) 
where C m = 1 — is the fraction of missing observations. 
Step 5 Compute by thresholding with the thresholding value g(a^). 

Step 6 Apply the inverse DWT to and obtain the t-th iterative estimate = W T w^ . 

In all our numerical experiments, convergence was declared if — a^^ 1 ^ \ / < e. Upon conver- 
gence, estimates of /, as well as a, will be obtained. It is obvious that computationally this Sim 
algorithm is much less intensive than MISC, because it only requires one complete-data wavelet 
shrinkage computation within each iteration, in contrast to the M sets of computation required by 
MISC. 

A key component of the Sim algorithm is the variance inflation formula (8), which takes into 
account the effect of those imputed yf^s on the estimation of a 2 . The formula was borrowed 
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from the EM algorithm for estimating a 2 with normal regression under independent errors, which 
requires replacing each missing yf by its conditional expectation 



E 



y ohs ;f = F- 1) y = {a^r 



{yf } } 2 + {^} 2 . 



Replacing the complete-data sufficient statistics J2i Hi an d J2i Ui i n the complete-data MLE for 
a 2 then lead to (8). Although we recognize that this adjustment may not be consistent with 
the method used for obtaining the unadjusted which often is not MLE, we adopt (8) for 
its conceptual and implemental simplicity. As demonstrated in Section 5, it works quite well 
in the sense that not carrying out this variance inflation adjustment would lead to noticeably 
poorer wavelet estimates. However, this variance inflation adjustment does not account for all the 
uncertainty in the thresholding due to missing data, and hence it does not work well when the 
percentage of missing data is large. A better approximation therefore is needed. 

3.3 A Refined (Ref) Fast Algorithm 

To simplify presentation, we will use single-indexing wi instead of the usual double-indexing Wjk 
notation to denote a wavelet coefficient. At the i-th step, we need to calculate 

f it) = E{f C or Il \y ohs J = f {t - 1) }, 

which amounts to calculating all 

4 ] = E{l\ wl \> g{ a ) w l \y ohs J = /C*- 1 )} , (9) 



where wi's and a are respectively the complete-data empirical wavelet coefficients and variance 
estimate. The simple algorithm in Section 3.2 took a very crude approximation of this conditional 
expectation, by thresholding the conditional expectation of wi with g{a) using the adjusted a. We 
can obtain better approximations if we are willing to give up some generality (but see Section 4.2). 

For example, under the i.i.d. normal error setting of (1), we can obtain a much refined ana- 
lytic approximation if we ignore the conditional variability in a when calculating the conditional 
expectation for (9). To proceed, we need to define the quantity rji: if wf^ is the empirical wavelet 
coefficient obtained from Step 2 of the Sim algorithm, then the conditional distribution of wi given 
{y obs ,ci 2 } is M(wf\rifa 2 ). With this setup and denote g(a) by a constant c to signify the fact 
that we assume it is known, we shown in Appendix A that (9) has the following simple analytic 
expression 

wf ] = a(wf ] , m ) + Piwjp , m) x , (10) 
where the a and (3 functions are given by 

TIG ( 1 / c-w \ 2 l ( c+w \ 2 ~\ / c — w\ ( C + w\ 

a(w,rj) = -!= <e 2 { ^ ) - e *\ n° ) \ and f3(w, rj) = 2 - $ - $ , (11) 

V27T I J V W J \ r)o J 

with $ being the CDF function of A/"(0, 1). 

The resulting algorithm is identical to Sim except that we replace its Step 5 by (10) and (11), 
where we use a = a® and c = g(a^). Thus, computationally, this new refined (Ref) algorithm is 
almost as efficient as Sim, and it is also straightforward to program as only standard functions arc 
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involved in (10) and (11). However, because it provides a much more refined E-step, the statistical 
efficiency of the resulting estimator is expected to be much closer to that of the MISC estimator 
with M = oo. Numerical results from Section 5 strongly support this expectation. 

The analytic formula (10)-(11) deserves several important remarks. First, under the assumption 
of i.i.d. noise error, the value of rji can be easily computed at the outset of iteration as the Ith. 
diagonal element of / — WRW T , where R is an N x N matrix whose off-diagonal elements are 
all zero, and whose Zth diagonal elements is one if yi is observed and zero otherwise; that is, the 
diagonal of R forms a "response indicator" vector for y. Note that the assumption of i.i.d. error 
is non-essential as long as the error covariance £ is known, which we do assume for the current 
approximation. Intriguingly, approximating all r^'s by their average, which is exactly the fraction 
of missing data C m as used in the variance inflation formula (8), works surprisingly well - see 
Section 4.3 for more discussion. 

Second, as a by-product, the quantity rji can be seen as a measure of the percentage of missing 
information in wi due to missing data. It is because < rji < 1, and that rji is one when there 
is no information in the observed data about wi and zero when wi is fully observed. Since the 
missing data here are the result of irregular design points, we can also view rji as a measure of 
irregularity in the data for the wavelet coefficient wi. Figure 1 provides some illustrative plots of 
rji, and demonstrates well that the effects of the missing data, as expected, are localized. Also, the 
plots indicate that these missing data seem to have a stronger impact on high frequency wavelet 
coefficients. We believe that this interesting by-product, that is, the "irregularity plot" (i.e., by 
displaying the rji 's on a frequency-location plot, as we do with the empirical wavelet coefficients), 
is worthy of further exploring, for example, for the purpose of diagnosing and determining the 
suitability of a particular wavelet regression model for a particular irregular design. 

Third, an intriguing insight suggested by (10) is that even when we choose to use hard threshold- 
ing with complete data, if we adopt the self-consistency recipe (5), we should use "soft" thresholding 
with incomplete data, as (10) is a form of soft thresholding. This can be seen from the fact that as 
long as rji > 0, < \wf\ < \wf^\, as implied trivially by (9). It can also been seen partially from 
the fact that as long as i] > 0, < f3{w,rj) < 1, and therefore it provides an "regression/shrinkage" 
effect. When rj — ► 0, a(w,rj) — > and /3(w,rj) — > 1|,„|> C , and thus (10) goes back to the original 
hard-thresholding, as it should be. 

Fourth, although the updating expressions (10) and (11) for wf^ were derived for the hard 
thresholding operation, it can be easily modified for soft thresholding: l(| TOi |> c )sign(^){|i(;j| — c}. 
By similar calculations as in Appendix A, for soft-thresholding, we only need to add a simple term 
to wp of (10), relabeled as wf\ ard . That is, we replace (10) by 

Our numerical experiments suggest that the uses of the hard and the soft thresholding operations 
inside the Ref algorithm give similar practical performances, perhaps a reflection that both are forms 
of "soft" thresholding after all in the presence of incomplete data. For this reason, for the rest of this 
paper we shall concentrate on the hard thresholding operation in our numerical experimentations. 
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Figure 1: Frequency-location plots of ^/rji. Top left: test function Bumps; top right: wavelet coefficients 
of Bumps] bottom left: plots of ^/rji with three missing observations; bottom right: plots of ^/rji with 10% 
missing observations. In the r\i plots those triangles at the bottom indicate locations of missing observations. 
Note that ^/rji is plotted instead of m to enhance visibility. 



4 Modifications and Extensions 

4.1 Incorporating Smoothness Assumptions 

As we shall see from the simulation results in Section 5, when comparing to the interpolation 
based methods, the self-consistency based procedures produce superior estimates for f(xi) if i £ 
Jobs, but tend to produce inferior estimates when i iobs- The reason for this is as follows. 
Whenever an interpolation is employed to fill in the missing responses {yi : i iobs}, some kind 
of smoothness constraint is effectively imposed on {/(xj) : i -fobs}- However, our basic self- 
consistency procedures do not impose any prior smoothness assumptions on {f(xi) : i iobs}- 
Since many regression functions are mostly smooth, it is reasonable to expect that methods that 
do not impose any smoothness constraints on {/(xj : i / bs} tend to produce inferior estimates 
for {f(xi) : i iobs} than those methods that do impose such constraints. This is similar to 
comparing the maximum likelihood procedure with a less efficient estimation procedure but with 
a good constraint, or prior. The latter can be better than the former, not because it is better in 
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retaining the relevant information in the data, but because of the reasonable constraint or prior 
information. 

This analogy also suggests that, if such a smoothness assumption is sensible, then this assump- 
tion should be included in the self-consistent procedures; that is, to take advantage of both the 
information available in the data and in the prior. For example, if linear interpolation is a good 
smoothing procedure to use, then the MISC, Sim, and Ref algorithms can be further improved by 
modifying at the end of each iteration: 

•k For each i I bs, replace f^\xi) by the following linearly interpolated value 

/ .. )(Xa) + /" ) ^)-/"'fa) (li - Xa) , (13) 

Xfr x a 

where x a G x Q ^ s is the largest observed design point that is less than Xj, and xj, G x Q ^ s is the 
smallest observed design point that is larger than Xi. 

The above linear interpolation rule, of course, can be replaced by many other interpolation rules 
if the corresponding smoothness assumptions are more appropriate. As demonstrated in Section 5, 
when such smoothness assumptions are appropriate, these hybrid algorithms outperform both the 
pure interpolation methods and the pure self-consistent methods in terms of the mean squared 
errors integrated over both i G i" bs and i / bs- 

4.2 Non— Gaussian Errors 

As emphasized before, one of the main advantages of adopting the self-consistency equation (5) is 
that it is not restricted to any particular model or error distribution, much like the EM algorithm 
is not restricted to a particular class of models. Indeed, the MISC algorithm is completely general, 
and can be easily implemented to obtain estimators under non-Gaussian errors: simply replace 
the Gaussian noise wavelet shrinkage procedure in Step 2 of MISC with a suitable non-Gaussian 
shrinkage method. A numerical demonstration with the Poisson model will be given in Section 5.4. 

Given specific non-Gaussian thresholding rules, we can also derive analytic approximations to 
the conditional expectation needed by (5), under a specified non-Gaussian model. That is, we can 
obtain refined algorithms for various other error structures, including correlated errors. It is not 
possible to give a general recipe here as how to perform such approximations, as they need to be 
worked out on a case by case basis. We emphasize that this should not be viewed as a disadvantage 
of the self-consistency method, because it is a general principle for constructing estimators, much 
like the implementation of the EM algorithm always calls for individual derivations of its E-step 
(and M-step) under the model of interest. 

4.3 Two— Dimensional Settings 

In many imaging applications, especially in remote sensing area, due to detector malfunction or 
some other reasons, the readings of a small fraction of image pixels are missing. These missing 
values forbid the direct use of wavelet techniques for image reconstruction. This is an ideal setting 
for the self-consistent procedures, as it is automatically an incomplete design problem and the 
percentage of missing data tends to be small. 

The generalizations of our methods from ID settings to 2D settings are in fact quite trivial. 
Indeed, for all of the above algorithms, the only modification needed is to replace the ID DWT 
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with a 2D DWT. As for the ID case, one would expect that the MISC algorithm would give the 
best results, followed by Ref and then Sim. However, due to its computational cost, the 2D MISC 
algorithm may not be practical. Even for the 2D Ref algorithm, the calculations of rji in (10)-(11) 
can be lengthy, because one needs to calculate individual elements of a 2D DWT matrix W. A 
very simple approximation is to replace all rji's by their average, which is 



This exceedingly simple approximation turned out working surprisingly well in all our simulation 
studies, for both ID and 2D cases, and therefore we recommend its use whenever the more refined 
calculations or approximations of rji are too expensive. 

Lastly we remark that in the image restoration or similar contexts, the incorporation of an 
off-the-shelf interpolation step often is not a good idea because real images tend to contain a large 
amount of discontinuities (e.g., edges). 

5 Numerical Experiments 

This section reports results of five sets of numerical experiments that were conducted to study the 
empirical properties of the above methods. Throughout this section, the D5 wavelet of Daubechies 
(1992) was used as the mother wavelet, and the primary resolution was 3. 

5.1 Visual Inspection 

Our first numerical experiment was simply to see if the MISC (Section 3.1), the Sim (Section 3.2) 
and the Ref (Section 3.3) algorithms, without using interpolation, would work at all. We used 
the four well-known testing functions of Donoho & Johnstone (1994): Heavisine, Doppler, Blocks 
and Bumps. For each of the test function, we first simulated a regularly-spaced noisy data set as 
in (1), with N = 2048 and signal-to-noise ratio (snr) 7. We then randomly deleted 30% and 50% 
of the observations and applied the three algorithms to reconstruct the curve, using the universal 
thresholding. 

The results for Heavisine are displayed in the first two columns of Figure 2 respectively. Each 
column in Figure 2 plots, from top to bottom, the noisy incomplete data set; the initial estimate 
/(°) obtained using the S-Plus function lowess with a 10% smoothing span; the first and the third 
iteration estimates /W and f^; the final estimate declared as soon as reaching — a^^ 1 ^ \/&^ < 
0.0001; and the plot of log(MRSS b s ) and log(MSE b s ) against the iteration step t. Here MRSS b s 
is the mean residual sum of squares and MSE Q b s is the mean squared error calculated over all the 
observed design points: 



Visually, we observe that the Sim algorithm produces good estimate when C m = 30% (first 
column) , and slightly worse estimate when C m = 50% (second column) . This worsening is expected 
because with 50% missing data there is a lot more uncertainty that could be adequately captured 
by the variance inflation formula (8). However, it still produces a much better result than the 



4trace(/ - WRW T ) = 1 - = C, 



(14) 



MRSS bs = - 



n 
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naive approach that directly uses the unadjusted a 2 in Step 4, namely, by setting C m = in (8), 
as displayed in the third column. Results of the MISC algorithm, with M = 100, is displayed 
in the fifth column for C m = 50%. This MISC algorithm does give a smaller MSE than the Sim 
algorithm, but at an expense of increased computational time. It is because the MISC algorithm 
takes more iterations and each iteration is roughly M = 100 times more expensive. A very exciting 
observation is that the Ref algorithm provides essentially identical results as MISC, as displayed in 
the fourth column, but with a computational load (especially when approximating all ?#'s by C m ) 
almost identical to that of the Sim algorithm. 

All the observations above were almost identically replicated with the other three testing func- 
tions; due to space limitation we omit these displays. For all these combinations, the Sim and the 
Ref algorithms used 0.5 to 3.5 seconds user time on a Sun Ultra 60 machine, depending on the 
testing function. For the MISC algorithm, it took about 100 x T times longer, where T is the ratio 
of the number of iterations under MISC to that of Sim or Ref (T typically varies from 2 to 5). 

5.2 Effects of Interpolation and Approximations 

Our second numerical experiment was conducted to study the effects of the different approximations 
used in the Sim and the Ref algorithms, using the MISC algorithm as a benchmark for comparison. 
The effects of interpolation is also studied. The same four testing functions were used. Other 
experimental factors are: N = 512, snr = (5,7) and missing percentage C m = (10%, 30%, 50%). 
For each combination of testing function, snr and missing percentage, 100 incomplete data sets 
were generated. Then for each of these 100 generated data sets, eight algorithms were applied to 
estimate /: 

1. MISC, 

2. MISCI - MISC with the interpolation step (13); 

3. Sim, 

4. SimI — Sim with (13); 

5. Ref, 

6. Refl — Ref with (13); 

7. RefA — Ref with all r^'s approximated by C m , namely, by (14), and 

8. RefAI — RefA with (13). 

Instead of using the universal thresholding value a\/2 log N, in this experiment we follow Antoniadis 
& Fan (2001) and used o^j2\ogN — log(l + 256 log N). Antoniadis & Fan (2001) showed that this 
latter thresholding value is superior to the universal thresholding value. 

Figure 3 presents the boxplots, from top to bottom, of MSE com , MSE b s , and MSE m i s , where 
each column corresponds to a testing function. Here MSE b s is the same as in Section 5.1, and 
MSE m i s and MSE com are its counterparts summing over respectively all the missing design points 
and all the design points. The fraction of missing data is C m = 30% and snr = 7. Results for 
C m = (10%, 50%) and snr = 5 are similar and hence are omitted. 

From the boxplots the following empirical observations can be made: 
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Figure 2: The performances of the Sim, Ref, and MISC algorithms for recovering Heavisine. For the second 
to the fourth rows, solid curves are regression estimates and the dotted curve is the true function. See text 
for further details. 
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• Algorithms with interpolation are superior to their counterparts. This illustrates the impor- 
tance of employing suitable prior information in wavelet regression, and the above results 
helped to identify this importance by separating the efficiency inherited in the observed data 
from the prior information implicitly built into interpolation procedures. 

• When comparing results from Ref and RefA, and from Refl and RefAI, the rj approxima- 
tion (14) does not seem to have any adverse effects on Ref or Refl. 

• The performance of those Ref-type algorithms are very similar to the two MISC algorithms. In 
fact for many experimental configurations, results from formal statistical tests (not reported 
here) suggest that the difference of the MSE values of RefAI and MIS CI are statistically 
insignificant. 

Given these observations, RefAI seems to be the best compromise, both in terms of statistical 
performance and computational speed. 
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Figure 3: Boxplots of MSE com , MSE b s and MSE m ; s for the eight algorithms tested in Section 5.2. 
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5.3 Comparisons with Existing Methods 

In this third experiment the algorithms SimI and RefAI are compared to two popular wavelet 
regression procedures for non-equispaced data. These two procedures are the IRREGSURE pro- 
cedure of Kovac & Silverman (2000) and the ROSE procedure of Antoniadis & Fan (2001). The 
procedure IRREGSURE is an interpolation based procedure where the SURE method of Donoho 
& Johnstone (1995) is employed as the thresholding procedure. The ROSE procedure is a one-step 
iterative procedure that uses penalized least squares and hard thresholding. Again, for SimI, RefAI 
and ROSE, the thresholding value used was ay / 2\ogN — log(l + 256 log A"). 

We first present our results for those cases with N = 512. For each combination of testing 
functions, snr = (5,7) and C m = (10%, 30%, 50%), 200 noisy data sets were generated. For each 
noisy data set, the four procedures SimI, RefAI, IRREGSURE and ROSE were applied to obtain 
estimates for /, and their corresponding MSE com , MSE Q b s , and MSE m j s were computed. In addition, 
for the reason of providing a benchmark comparison, the universal thresholding procedure with 
threshold a^2 log N - log(l + 256 log N) was also applied to the complete noisy data set. We will 
label this procedure UniComp. As the complete data set was available to UniComp, it is expected 
that UniComp would produce smaller MSE values than the other four procedures. For the case 
snr = 7 and C m = 30%, boxplots of the MSE com , MSE b s and MSE m i s values for the five methods 
are displayed in Figure 4 in a similar fashion as before. Results under snr = 5 and C m = (10%, 50%) 
are similar and hence are omitted. 

Pairwise Wilcoxon tests were applied to test if any two of the four procedures have significantly 
different median values for MSE com , MSE Q b s and MSE m i s . The significance level used was 1.25% 
because of Bonferroni correction for multiple comparisons. Based on the testing results, we ranked 
a procedure first if its median MSE value is significantly less than those of the remaining three, we 
ranked it second if its median MSE value is significantly less than two but greater than the remaining 
one, and so on so forth. If the median MSE values of two procedures are not significantly different, 
they will share the same averaged rank. These rankings are listed in Table 1. While no ranking 
method is perfect, such a ranking does provide a good indicator of the relative merits of the methods 
being compared. Rankings under snr = 5 are almost identical, and thus are omitted. 

From Figure 4 and Table 1 one may conclude that RefAI is generally superior to the other 
three procedures, SimI and IRREGSURE are roughly the same, while ROSE is inferior. A partial 
explanation for the relatively poorer performance of ROSE is that it does not employ interpola- 
tion, indicating potentially misleading comparative conclusions if we do not distinguish between 
the information from the data and that from (implicitly) imposed smoothness assumptions. By 
examining the boxplots, one can see that, when comparing the MSE Q b s values, the performance of 
RefAI is in fact very similar to UniComp. The above experiment was repeated for N = 2048, but 
without the ROSE procedure. The relative rankings of RefAI, SimI and IRREGSURE remain the 
same. 

5.4 Poisson Model 

Displayed in the top panel of Figure 5 are the Poisson photon counts, captured at 256 time in- 
tervals, from the collapsed star RXJ1856. 5-3754, which is about 400 light years from Earth in the 
constellation Corona Australis. Note that in this plot the time index has been re-scaled to [0,1]. 
One reason that astrophysics scientists are interested in this collapsed star is that they believe its 
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Figure 4: Boxplots of MSE com , MSE b s and MSE m ; s for the four procedures compared in Section 5.3. 



matter is even denser than nuclear matter, the most dense matter found on Earth. We refer inter- 
ested readers to http://chandra.harvard.edu/photo/2002/0211/index.html for the scientific 
issues surrounding this data set. 

From this Poisson data set the following two smoothed curve estimates are obtained. The first 
one was constructed from the complete data set while the second one was constructed with the 
presence of missing data. The complete data reconstructed curve was obtained by applying the 
Poisson wavelet thresholding rule of Kolaczyk (1999, Equation (13)) to the full data set, and it is 
displayed as the solid line in the bottom panel of Figure 5. The missing-data curve estimate was 
obtained as follows. Firstly 5% of the data points were removed. These missing data points are 
marked by "o" in the top panel and their locations are indicated by "x" in the bottom panel of 
Figure 5. Then the MISC algorithm was applied to this missing data set with the same Poisson 
thresholding rule and M = 100. The resulting curve estimate is the broken line in the bottom 
panel of Figure 5. 

From these plots one could notice that in regions with no missing data (e.g., for t in [0,0.15] 
and [0.6,0.75]) or when the values of the missing data are not local extrema (e.g., at t « 0.26 
and t ~ 0.76), the complete-data and the missing-data estimates are virtually the same. However, 
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Table 1: Wilcoxon rankings for the four wavelet regression procedures compared in Section 5.2 when 
N = 512. 

the two estimates do have small differences at regions where missing data are clustered (e.g., at 
t ps 0.44) or when the values of the missing data are local extrema (e.g., t ps 0.19 and t ps 0.79). This 
simple example therefore illustrates both the feasibility of the MISC algorithm for non-Gaussian 
data, and the fact that wavelets methods have the ability to localize the damage caused by the 
incomplete observations. 

5.5 Image Denoising 

In this last experiment we explore the performance of our algorithms in the context of image 
denoising. We are only aware of a very limited number of existing methods that are specifically 
designed to perform image denoising with missing data. One method was described by Naveau 
& Oh (2004), which although can be applied to handle missing values around the image edges, 
was primarily proposed to reduce boundary effects. Their updating algorithm is similar to our 
Sim algorithm, but without the variance inflation formula (8), and therefore inferior results are 
expected. The other is by Hirakawa <fe Meng (2006) using an EM-type approach for simultaneous 
demosicing and denosing, which in fact was motivated by our current work. 

Three 2D algorithms were studied: the MISC (with M = 10), the Sim, the RefA procedures. 
Two testing images of size 256 x 256 were used: the well-known Lena image displayed in Figure 6(a) 
and the Airplane image displayed in Figure 6(b). Also, two snrs and three missing data percent- 
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Figure 5: Top panel: photon counts from the collapsed star RXJ1856. 5-3754. Circles indicate (artificial) 
missing observations. Bottom panel: reconstructed curves using the complete data (solid line) and missing 
data (broken line). The crosses mark the locations of the missing data. 

ages were tested: snr = (5,7) and C m = (10%, 30%, 50%). Lastly, two missing data formation 
mechanisms were tested. The first one is missing at random, in which missing pixel locations were 
randomly selected from the image, while in the second mechanisms the missing pixels were clus- 
tered together. Note that because of the computational cost, it was too costly to run MISC with 
M = 100 for our simulation studies, which typically takes roughly 1 hour for one replicate on a 
Sun Ultra 60 machine. 

For each of the above experimental factor combinations, 100 noisy images were generated, and 
the above three algorithms were applied to reconstruct the corresponding true images, using the 
adjusted universal thresholding value: &y/2 log N — log(l + 256 log N). As to provide a benchmark 
for comparison, for each noisy image, we also applied the universal denoising method (Donoho & 
Johnstone 1994), with the same adjusted thresholding value, to reconstruct the corresponding true 
image using the complete data. As before, we refer this method as UniComp. As UniComp has 
the full information from y, it is expected that it would produce better reconstructed images than 
the other three algorithms. 

For every reconstructed images, we calculated MSE com , MSE b s and MSE m i s as measures of 
reconstruction quality. (We are aware of the fact that MSE is not a good measure for visual 
quality, but in the absence of a commonly agreed measure for visual quality, the MSE still serves as 
a statistically useful criterion for comparisons). In addition we also computed the following MSE 
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ratio: 

r fMISC) - MSEcom ° f MISC 
comi j " MSE com of UniComp- 

Similar MSE ratios for the observed (r b s (MISC)) and missing data (r m i S (MISC)), and for the Sim 
and RefA algorithms (r com (Sim), r obs (Sim), r mis (Sim), r com (RefA), r obs (RefA) and r mis (RefA)) 
were also calculated. Since UniComp reconstructed the images with the complete data, it is ex- 
pected that all these MSE ratios are bigger than 1. For snr = 7 and C m = 30%, boxplots of these 
MSE ratios are given in Figure 7. Boxplots for other experimental settings are similar and hence 
omitted. From Figure 7 some major empirical conclusions can be obtained. 

First, as all r b s (MISC) values are fairly close to 1, the easy-to-implement benchmark MISC 
algorithm performs reasonably well for those observed pixels. Secondly, the RefA algorithm is su- 
perior to the other two algorithms, as it does not require multiple imputation (as opposed to MISC) 
and it uses a better approximation than the Sim algorithm. Lastly, an unexpected observation is 
that, r Q bs(RefA) is in fact less than 1 when the locations of the missing data are clustered together. 
We currently do not have an explanation for this phenomenon, other then noting that hidden biases 
resulting from model defects can be more pronounced with more data. 

For the purpose of visual inspection, two degraded versions of Lena are displayed in Figures 8(a) 
and 9(a). Those black pixels represent the locations of the missing values. The snr is 7 and the 
missing percentage is 10%. Figures 8(b) and 9(b) display the corresponding reconstructed images 
obtained from the RefA algorithm. The quality of the reconstructed ones is quite acceptable. The 
one with clustered missing data is particularly impressive, especially considering that the method 
we used did not take into account the cluster nature of the missing data. Reconstruction algorithms 
as such are particularly useful for image inpainting (e.g., see Criminisi, Perez Sz Toyama 2004 and 
Tschumperle & Deriche 2005). 



6 Summary and Future Works 

A main goal of this paper is to demonstrate that the self-consistency principle is a very versatile and 
fruitful method for dealing with wavelet modeling, and more generally with non-parametric and 
semi-parametric regressions, when facing incomplete data. By viewing irregularly-spaced data as a 
form of incomplete data, it also provides a rather general methodology for wavelet reconstructions 
with irregularly-spaced data by taking advantage of the existence of those well-studied methods 
developed for regularly-spaced data, much in the same spirit as with the EM algorithm or multiple 
imputation. Indeed, the specific algorithms we proposed here directly use either multiple imputation 
or steps very similar to the E-step (and M-step) of the EM algorithm. 

Much remains to be done, of course. The most urgent ones are theoretical properties of the es- 
timators and algorithms we propose. Simulations were crucial in our development of the estimators 
and algorithms given in this paper, but they are no substitute of rigorous theoretical investigations. 
It is therefore our hope that our empirical findings are convincing, or at least suggestive, enough 
that they would motivate a general theoretical investigation of the self-consistent wavelet estima- 
tors, or more generally self-consistent regression estimators. It should also be of great interest 
to investigate the theoretical connections between the self-consistent wavelet estimators and other 
constructions of wavelet estimators with irregularly-spaced data, such as via lifting (e.g., Delouille 
et al. 2004 and Nunes et al. 2006). 
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(a) Lena 



(b) Airplane 



Figure 6: Testing images used in the image denoising experiment. 

On the methodological side, besides developing even more refined algorithms, especially for 
non-Gaussian and correlated errors, a logical next step is to combined the self-consistent principle 
with Bayesian methods. Indeed, with Bayesian methods, the dealing with missing data is done 
jointly with the inference of parameters and regression functions. Preliminary work has shown 
great promise, as reported in Hirakawa &i Meng (2006). We are currently investigate the self- 
consistent approach with a number of Bayesian methods for wavelet reconstructions, including 
over-complete expansions (e.g., Abramovich et al. 1998, Johnstone & Silverman 2005 and Donoho, 
Elad & Temlyakov 2006). 
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A Derivations of (10), (11) and (12) 

To show (10) and (11), let wi ~ M(d,T 2 ) and write Z = m ^ L ; i.e., Z is standard normal. We will 
also need the following fact. Let Z ~ Af(0, 1), and denote its probability density function by 4>(z). 
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Figure 7: Boxplots of the MSE ratios resulted from the image denoising experiment in Section 5. In each 
panel the left, middle and right boxplots correspond, respectively, to the MISC, Sim and RefA algorithms. 



Then, for any constant c, E{1^ Z>C ^Z} = r/>(c) and E{1( Z< _ C )Z} = —(j)(c). 
With this setup, we have 

E{l(\ Wl \> c )Wi} = E{l {d+TZ > c) {d + TZ)} + E{l {d+rZ <_ c) {d + TZ)} 

= dp(z> + ^{1(^^,2} + dp[z< d£±Q j + T E{l {z< - (c+d)) Z} 
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Now by substituting d = wf^ and r = rjia into the above expression, we obtain (10) and (11). 

Equation (12) is derived using essentially the same steps as above, except the whole calculation 
begins with £ , [l ( -| u , ; |> c )sign(^i){|zi;/| — c}], which only differs from the hard-thresholding formula by 
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(a) Degraded Lena 



(b) Reconstructed Lena 



Figure 8: Degraded (a) and reconstructed (b) Lena when the pixels are missing at random. 

the simple term —cE{l^ w ^ >c -^sign(wi)} = —c{P(wi > c) — P(wi < — c)}. This can be easily seen 
to be corresponding to the second term on the right-hand side of (12). 
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